Goto

Collaborating Authors

 reward system


ReDit: Reward Dithering for Improved LLM Policy Optimization

Neural Information Processing Systems

DeepSeek-R1 has successfully enhanced Large Language Model (LLM) reasoning capabilities through its rule-based reward system. While it's a ''perfect'' reward system that effectively mitigates reward hacking, such reward functions are often discrete. Our experimental observations suggest that discrete rewards can lead to gradient anomaly, unstable optimization, and slow convergence. To address this issue, we propose ReDit (Reward Dithering), a method that dithers the discrete reward signal by adding simple random noise. With this perturbed reward, exploratory gradients are continuously provided throughout the learning process, enabling smoother gradient updates and accelerating convergence.


Hamsters run on wheels for a surprisingly joyful reason

Popular Science

Even wild animals enjoy a good wheel. More information Adding us as a Preferred Source in Google by using this link indicates that you would like to see more of our content in Google News results. Turns out, that midnight "workout" might not be boredom or restlessness after all. Breakthroughs, discoveries, and DIY tips sent six days a week. By signing up, you confirm you are 16+, will receive newsletters and promotional content and agree to our Terms of Use and acknowledge the data practices in our Privacy Policy .


General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

arXiv.org Artificial Intelligence

Abstract-- Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation. I. INTRODUCTION Many real-world robotics tasks, such as sorting objects in e-commerce warehouses or visual navigation to a target site, involve solving a multi-goal problem. These tasks require the agent to act in a particular way among numerous options to achieve various desired outcomes.


ProRe: A Proactive Reward System for GUI Agents via Reasoner-Actor Collaboration

arXiv.org Artificial Intelligence

Reward is critical to the evaluation and training of large language models (LLMs). However, existing rule-based or model-based reward methods struggle to generalize to GUI agents, where access to ground-truth trajectories or application databases is often unavailable, and static trajectory-based LLM-as-a-Judge approaches suffer from limited accuracy. To address these challenges, we propose ProRe, a proactive reward system that leverages a general-purpose reasoner and domain-specific evaluator agents (actors). The reasoner schedules targeted state probing tasks, which the evaluator agents then execute by actively interacting with the environment to collect additional observations. This enables the reasoner to assign more accurate and verifiable rewards to GUI agents. Empirical results on over 3K trajectories demonstrate that ProRe improves reward accuracy and F1 score by up to 5.3% and 19.4%, respectively. Furthermore, integrating ProRe with state-of-the-art policy agents yields a success rate improvement of up to 22.4%.


VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

arXiv.org Artificial Intelligence

Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.


'Don't ask what AI can do for us, ask what it is doing to us': are ChatGPT and co harming human intelligence?

The Guardian

Imagine for a moment you are a child in 1941, sitting the common entrance exam for public schools with nothing but a pencil and paper. You read the following: "Write, for no more than a quarter of an hour, about a British author." Today, most of us wouldn't need 15 minutes to ponder such a question. We'd get the answer instantly by turning to AI tools such as Google Gemini, ChatGPT or Siri. Offloading cognitive effort to artificial intelligence has become second nature, but with mounting evidence that human intelligence is declining, some experts fear this impulse is driving the trend.


All games with loot boxes will be rated M or higher in Australia

PCWorld

Loot boxes in video games and mobile games have become less of a flashpoint for controversy, but a few years ago they were a major target of ire for both gamers and regulators. The wheels of justice (or at least of legislation) turn slowly, but they do turn, and Australia is making a big move in this sector. Starting this Sunday, any game sold in Australia with loot boxes will be rated either M (Mature) or R 18 (Restricted). For the uninitiated, loot boxes are essentially digital blind boxes. Gamers buy a loot box (or several) in the hopes of finding rare items, weapons, or character outfits. But actually getting what you want is pure chance… and chance that's artificially slimmed down to an incredible longshot for the most rare and desirable items.


Evolution of Rewards for Food and Motor Action by Simulating Birth and Death

arXiv.org Artificial Intelligence

The reward system is one of the fundamental drivers of animal behaviors and is critical for survival and reproduction. Despite its importance, the problem of how the reward system has evolved is underexplored. In this paper, we try to replicate the evolution of biologically plausible reward functions and investigate how environmental conditions affect evolved rewards' shape. For this purpose, we developed a population-based decentralized evolutionary simulation framework, where agents maintain their energy level to live longer and produce more children. Each agent inherits its reward function from its parent subject to mutation and learns to get rewards via reinforcement learning throughout its lifetime. Our results show that biologically reasonable positive rewards for food acquisition and negative rewards for motor action can evolve from randomly initialized ones. However, we also find that the rewards for motor action diverge into two modes: largely positive and slightly negative. The emergence of positive motor action rewards is surprising because it can make agents too active and inefficient in foraging. In environments with poor and poisonous foods, the evolution of rewards for less important foods tends to be unstable, while rewards for normal foods are still stable. These results demonstrate the usefulness of our simulation environment and energy-dependent birth and death model for further studies of the origin of reward systems.


Planning the path with Reinforcement Learning: Optimal Robot Motion Planning in RoboCup Small Size League Environments

arXiv.org Artificial Intelligence

This work investigates the potential of Reinforcement Learning (RL) to tackle robot motion planning challenges in the dynamic RoboCup Small Size League (SSL). Using a heuristic control approach, we evaluate RL's effectiveness in obstacle-free and single-obstacle path-planning environments. Ablation studies reveal significant performance improvements. Our method achieved a 60% time gain in obstacle-free environments compared to baseline algorithms. Additionally, our findings demonstrated dynamic obstacle avoidance capabilities, adeptly navigating around moving blocks. These findings highlight the potential of RL to enhance robot motion planning in the challenging and unpredictable SSL environment.


Game-Theoretical Analysis of Reviewer Rewards in Peer-Review Journal Systems: Analysis and Experimental Evaluation using Deep Reinforcement Learning

arXiv.org Artificial Intelligence

In this paper, we navigate the intricate domain of reviewer rewards in open-access academic publishing, leveraging the precision of mathematics and the strategic acumen of game theory. We conceptualize the prevailing voucher-based reviewer reward system as a two-player game, subsequently identifying potential shortcomings that may incline reviewers towards binary decisions. To address this issue, we propose and mathematically formalize an alternative reward system with the objective of mitigating this bias and promoting more comprehensive reviews. We engage in a detailed investigation of the properties and outcomes of both systems, employing rigorous game-theoretical analysis and deep reinforcement learning simulations. Our results underscore a noteworthy divergence between the two systems, with our proposed system demonstrating a more balanced decision distribution and enhanced stability. This research not only augments the mathematical understanding of reviewer reward systems, but it also provides valuable insights for the formulation of policies within journal review system. Our contribution to the mathematical community lies in providing a game-theoretical perspective to a real-world problem and in the application of deep reinforcement learning to simulate and understand this complex system.